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[57] ABSTRACT 

A method and system are provided for processing records 
from a set of records, where records are repeatedly being 
added to the set of records, and where each record in the set 
of records has to be processed once for each of a plurality of 
entities. According to the method, each record that is added 
to the set of records is marked with a defauh batch value. For 
each entity of the plurality of entities, a batch of the records 
is processed by performing the steps of: reading a last batch 
value associated with the entity, processing the records in the 
set of records that are marked with batch values that are 
more recent than the last batch value associated with the 
entity, and updating the last batch value associated with the 
entity to a most recent batch value of the records processed 
for the entity. Between processing consecutive batches for 
an entity of the plurality of entities, the set of records are 
marked by performing the steps of: updating a batch counter 
value to reflect a more recent batch number; and marking all 
records in the set of records that have the default batch value 
with the batch counter value. 

27 Claims, 9 Drawing Sheets 
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DEQUEUING USING QUEUE BATCH 
NUMBERS 

RELATED APPLICATIONS 

ITie present Application is related to the following Appli- ^ 
cations: U.S. patent appUcation Ser. No. 08/770,573, entitled 
"Parallel Queue Propagation," filed by Alan Demers, James 
Stamos, Sandeep Jain, Brian Oki, and Roger J. Bamford on 
Dec, 19, 1996; and U.S. patent application Ser. No. 08/772, 
003, entitled "Recoverable Replication Without Distributed 
Transactions," filed by Alan Demers and Sandeep Jain on 
Dec. 19, 1996 and now U.S. Pat. No. 5,781,912. 

HELD OF THE INVENTION 

15 

The present invention relates to database systems, and 
more particularly to techniques for propagating changes 
from one site to another. 

BACKGROUND OF THE INVENTION 

Under certain conditions, it is desirable to store copies of 
a particular set of data, such as a relational table, at multiple 
sites. If users are allowed to update the set of data at one site, 
the updates must be propagated to the copies at the other 
sites in order for the copies to remain consistent. 'ITie process 25 
of propagating the changes is generally referred to as 
replication. 

Various mechanisms have been developed for performing 
replication. Once such mechanism is described in U.S. 
patent application Ser. No. 08/126,586 entitled "Method and 
Apparatus for Data Replication", filed on Sep. 24, 1993 by 
Sandeep Jain and Dean Daniels, the contents of which are 
incorporated by reference. 

The site at which a change is initially made to a set of 
repUcated data is referred to herein as the source site. The 
sites to which the change must be propagated are referred to 
herein as destination sites. If a user is allowed to make 
changes to copies of a particular table that are at different 
sites, those sites are source sites with respect to the changes 
initially made to their copy of the table, and destination sites 
with respect to the changes initially made to copies of the 
table at other sites. 

Replication does not require an entire transaction that is 
executed at a source site to be re-executed at each of the 45 
destination sites. Only the changes made by the transaction 
to replicated data need to be propagated. Thus, other types 
of operations, such as read and sort operations, that may 
have been executed in the original transaction do not have to 
be re-executed at the destination sites. 

Row-level replication and column-level replication con- 
stitute two distinct styles of replication. In row-level or 
column-level replication, the updates performed by an 
executing transaction are recorded in a deferred transaction 
queue. The information recorded in the deferred transaction 55 
queue includes both the old and the new values for each data 
item that was updated. Row- level and column-level repli- 
cation differ with respect to whether old and new values are 
transmitted for an entire relational row (row-level) or for 
only a subset of its columns (column-level). 

The changes recorded in the deferred transaction queue 
are propagated to the destination site. The destination site 
first checks that its current data values agree with the 
transmitted "old" values. The check may fail, for example, 
if concurrent changes have been made to the same replicated 65 
data at different sites. If the check fails, a conflict is said to 
have been detected. Various techniques may be used to 
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resolve such conflicts. If no conflict is detected, the current 
data values at the destination site are replaced with the 
transmitted "new" values. 

Referring to FIG, 1, it illustrates a system in which copies 
of a table 118 are stored at multiple sites. Specifically, the 
system includes three sites 100, 102 and 104. Sites 100, 102 
and 104 include disks 106, 108 and 110 that store copies 
120, 122 and 124 of table 118, respectively. Database servers 
130, 132 and 134 are executing at sites 100, 102 and 104, 
respectively. 

Assume that database server 130 executes a transaction 
that makes changes to copy 120. When execution of the 
transaction is successfiiUy completed at site 100, a record of 
the changes made by the transaction is stored in a deferred 
transaction queue 160 of a replication mechanism 140. Such 
records are referred to herein as deferred transaction records. 
Typically, the deferred transaction queue 160 will be stored 
on a non-volatile storage device so that the information 
contained therein can be recovered after a failure. 

Replication mechanism 140 includes a dequeue process 
for each of sites 102 and 104. Dequeue process 150 peri- 
odically dequeues all deferred transaction records that (1) 
involve changes that must be propagated to site 102, and (2) 
that dequeue process 150 has not previously dequeued. The 
records dequeued by dequeue process 150 are transmitted in 
a stream to site 102. The database server 132 at site 102 
makes the changes to copy 122 of table 118 after checking 
to verify that the current values in copy 122 match the "old 
values" contained in the deferred transaction records. 

Similarly, dequeue process 152 periodically dequeues all 
deferred transaction records that (1) involve changes that 
must be propagated to site 104, and (2) that dequeue process 
152 has not previously dequeued. The records dequeued by 
dequeue process 152 are transmitted in a stream to site 104. 
The database server 134 at site 104 makes the changes to 
copy 124 of table 118 after checking to verify that the 
current values in copy 124 match the "old values" contained 
in the deferred transaction records. 

Various obstacles may impede the eflSciency of the rep- 
lication mechanism 140 illustrated in FIG. 1. For example, 
a mechanism must be provided which allows dequeue 
processes 150 and 152 to distinguish between the deferred 
transaction records within deferred transaction queue 160 
that they have already dequeued, and the deferred transac- 
tion records that they have not yet dequeued. 

Further, a single stream connects dequeue processes 150 
and 152 to their corresponding destination sites. Efficiency 
may be improved by establishing multiple streams between 
the source site and each of the destination sites. However, 
there are constraints on the order in which updates must be 
applied at the destination sites, and the replication mecha- 
nism has no control over the order in which commands that 
are sent over one stream are apphed at a destination site 
relative to commands that are sent over a different stream. 
Therefore, a transmission scheduling mechanism must be 
provided if commands are to be sent to a destination site 
over more than one stream. 

Currently, database systems implement replication by 
executing deferred transactions using two phase commit 
techniques. During two phase commit operations, numerous 
messages are sent between the source site and each of the 
destination sites for each transaction to ensure that changes 
at all sites are made permanent as an atomic event. While the 
use of two phase commit techniques ensures that the various 
databases may be accurately recovered after a failure, the 
overhead involved in the numerous inter-site messages is 
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significant. Therefore, it is desirable to provide a mechanism present invention. It will be apparent, however, to one 

that involves less messaging overhead than two phase com- skilled in the art that the present invention may be practiced 

mit techniques but which still allows accurate recovery after without these specific details. In other instances, well-known 

a failure. structures and devices are shown in block diagram form in 

5 order to avoid unnecessarily obscuring the present inven- 

SUMMARY OF THE INVENTION tion. 

A method and system are provided for processing records Hardware Overview 
from a set of records, where records are repeatedly being 

added to the set of records, and where each record in the set Referring to FIG. 2, it is a block diagram of a computer 

of records has to be processed once for each of a plurality of system 200 upon which an embodiment of the present 

entities. According to the method, each record that is added invention can be implemented. Computer system 200 

to the set of records is marked with a default batch value. includes a bus 201 or other communication mechanism for 

For each entity of the plurality of entities, a batch of the communicating information, and a processor 202 coupled 

records is processed by performing the steps of: reading a ^^h bus 201 for processmg mformation. Computer system 

last batch value associated with the entity, processing the composes a random access memory (RAM) or 

records in the set of records that are marked with batch ^ther dynamic storage device 204 (referred to as mam 

values that are more recent than the last batch value asso- memory), coupled to bus 201 for stonng mformation and 

ciated with the emity, and updating the last batch value instructions to be executed by processor 202. Mam memory 

associated with the entity to a most recent batch value of the ^^ed for stormg temporary variables or 

records processed for the entity. intermediate information during execution of instruc- 

, u . L r c lions by processor 202. Computer system 200 also com- 

Between processmg consecutive batches for an entity of „ rDr^\A\ o„//^.^tt,o.o*.t.v et^.n«- 

^. 1 r . c J 1 pnses a read only memory (ROM) and/or other static storage 

the plurauty of entities, the set of records are marked by ^ • < , , . ^ r . • * . • r *• 

/ . . , / J . , . , / device 206 coupled to bus 201 for storing static information 

performmg the steps or: updatmg a batch counter value to a - * r in^* * # j • 'i/vr 

. . L X L u J 1- II J and instructions for processor 202. Data storage device 207 

reflect a more recent batch number; and marking all records . i j * u -iM c * • • r *• j • * 

. . - J . . . J r , L . L , -.L IS coupled to bus 201 for stormg information and mstruc- 

m the set of records that have the default batch value with ^-^^^ 

the batch counter value. 

A data storage device 207 such as a magnetic disk or 

BRIEF DESCRIPTION OF THE DRAWINGS optical disk and its corresponding disk drive can be coupled 

to computer system 200. Computer system 200 can also be 

The present invention is illustrated by way of example, 30 ^^^^^^^ 2OI to a display device 221, such as a 

and not by way of limitation, in the figures of the accom- ^^thode ray tube (CRT), for displaying information to a 

panymg drawings and m which like reference— numerals computer user. Computer system 200 further includes a 

refer to similar elements and m which: keyboard 222 and a cursor control device 223, such as a 

FIG. 1 is a block diagram of a computer system that mouse, 

includes a replication mechanism; present invention is related to the use of computer 

FIG. 2 is a block diagram of a computer system that may system 200 to propagate to other sites changes made to data 

be used to implement the present invention; on disk 207. According to one embodiment, replication is 

FIG. 3 A is block diagram of a portion of a replication performed by computer system 200 in response to processor 

system in which queue batch numbers are used to coordinate 4Q 202 executing sequences of instructions contained in 

dequeuing operations according to an embodiment of the memory 204. Such instructions may be read into memory 

invention; 204 from another computer-readable medium, such as data 

FIG. 3B illustrates the system of FIG. 3A after a stamping storage device 207. Execution of the sequences of instruc- 

operation is performed; ^^o^s contamed in memory 204 causes processor 202 to 

nG.3C illustrates the system of FIG. 3B after a dequeu- 45 perfonn the process step that will be describ^^ 

. . - i ^ alternative embodiments, hard -wired cu-cuitry may be used 

mg operation is performed; , ^ . \- -^u • . 

* ^ m place of or in combination with software mstructions to 

no, 3D illustrates the system of HG. 3C after another implement the present invention. Thus, the pre sent inven- 

stamping operation is performed; ^^jj „ol lijj,jted to any specific combination of hardware 

FIG. 4 is a block diagram that illustrates propagation circuitry and software, 
mechanisms that propagate transactions using multiple 

streams per destination site according to an embodiment of Dequemng Techniques 

the invention; ^ mentioned above, one phase of the replication process 

FIG. 5 is a flow chart illustrating the steps used to involves placing deferred transaction records into a deferred 

schedule the transmission of transactions according to an 55 transaction queue. According to one embodiment, the 

embodiment of the invention; and deferred transaction queue is implemented as a relational 

FIG. 6 is a block diagram of a replication system in which table, where each deferred transaction record is stored as one 

the destination site maintains an applied transaction table or more rows within the table. 

that may be used in recovery after a faflure, according to an For example, a transaction record for a given transaction 

embodiment of the invention. 60 may consist of ten rows within the deferred transaction 

queue, where each of the ten rows corresponds to an update 
performed by the transaction and contains an old and new 
value for the update and an update sequence number that 

A method and apparatus for replicating data at midtiple identifies the order in which the update was performed 

sites is described. In the following description, for the 65 relative to the other updates performed by the transaction, 

purposes of explanation, numerous specific details are set The transaction record also contains a transaction identifier 

forth in order to provide a thorough understanding of the that identifies the transaction and a "prepared time" value 
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that indicates when the transaction finished execution (was 
"prepared") relative to other transactions. The transaction 
identifier and the prepared time value of a transaction may 
be stored, for example, in one of the rows that constitute the 
transaction record for the transaction. 

The process of dequeuing a deferred transaction record 
for one destination site does not automatically remove the 
deferred transaction record from the deferred transaction 
queue because the deferred transaction record may have to 
be dequeued for other destination sites. Once a deferred 
transaction record has been dequeued for all destination 
sites, the deferred transaction record may be removed from 
the deferred transaction queue by a process that may be 
entirely independent of the dequeuing processes. 

For example, in a replication environment consisting of N 
sites, each deferred transaction record must be dequeued 
N-1 times (once for each destination site) before it can be 
deleted from the deferred transaction queue. Because the act 
of dequeuing a deferred transaction record does not remove 
the deferred transaction record from the deferred transaction 
queue, the presence of a deferred transaction record within 
the deferred transaction queue does not indicate whether the 
deferred transaction record has been dequeued for any given 
destination site. 

For each destination site, a dequeuing process repeatedly 
performs a dequeuing operation on the deferred transaction 
queue. During every dequeuing operation the dequeuing 
process performs, it must only dequeue the deferred trans- 
action records for its destination site that it has not already 
dequeued, llierefore, a mechanism must be provided for 
determining which deferred transaction records within the 
deferred transaction queue have already been dequeued for 
each of the destination sites. 

The Prepare Sequence Approach 

One way to keep track of which deferred transaction 
records have been dequeued for each destination site 
involves storing within each deferred transaction record a 
sequence number that indicates the sequence in which the 
transaction associated with the deferred transaction record 
was made permanent ("committed") at the source site. Each 
dequeuing process then keeps track of the highest sequence 
number of the records that it has dequeued. At each subse- 
quent pass, the dequeuing process only reads those records 
with higher sequence numbers than the highest sequence 
number encountered on the previous pass. 

When the deferred transaction queue is implemented 
using a relational table, the process of dequeuing records 
from the deferred transaction queue may be implemented by 
executing a query on the table. To implement the prepare 
sequence approach described above, a dequeuing process 
would repeatedly execute the equivalent of the SQL query: 

select * from queue„table where sequence_ 
number>last_sequence_number order by sequence^ 
number; 

Generally, a transaction is not considered committed until 
a deferred transaction record for the transaction is written 
into the deferred transaction queue. Therefore, a commit 
time cannot be assigned to a transaction until the deferred 
transaction record is written into the deferred transaction 
queue. Consequently, the deferred transaction record that is 
written into the deferred transaction queue does not contain 
the true commit time of the corresponding transaction. In 
place of the commit time, a "prepared time value" is stored 
in the transaction record. Prepared time values indicate the 
time in which transacfions completed execution, not the 
actual time the transactions committed. 
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Because the transaction records do not contain actual 
commit times, the prepared time values are used as sequence 
numbers for the dequeuing technique described above. 
However, the database system is not able to guarantee that 

5 the deferred transaction records of isolated transactions will 
commit in the order in which the transactions acquire 
prepare times. Without such a guarantee, deferred transac- 
tion records may be written into the deferred U-ansaction 
queue out of prepare sequence. 

10 The possibility that deferred transaction records may be 
written into the deferred transaction queue out of prepare 
sequence renders the prepare sequence approach unusable. 
For example, assume that two transactions with sequence 
numbers S1<S2 are inserted into the defened transaction 

15 queue out of order. If a dequeue process performs a dequeue 
operation after the S2 deferred transaction record is inserted 
and before the SI deferred transaction record is inserted, 
then the highest sequence number seen by the dequeue 
process will be at least S2, When the dequeue process 

20 performs a subsequent dequeue operation, the dequeue pro- 
cess will only dequeue deferred transaction records that have 
sequence numbers greater than S2. The SI deferred trans- 
action record will be skipped and may never be dequeued by 
that dequeue process. 

25 

Sequence Stamp Locking 

One approach to avoid the out-of-sequence problem asso- 
ciated with the prepare sequence approach is to prevent 

3Q transactions from acquiring prepared time values until the 
deferred transaction records for all transactions that have 
previously acquired prepared time values are stored in the 
deferred transaction queue. If transactions cannot acquire 
prepared time values until the deferred transaction records 

35 for all transactions that have previously acquired prepared 
time values are stored in the deferred transaction queue, then 
the commit time order will always reflect the prepared time 
order. Thus, the prepared time may be treated as the commit 
time. 

40 For example, an "enqueue lock" may be used to restrict 
access to the sequence assignment mechanism. Before a 
transaction can be assigned a sequence number, the trans- 
action must acquire the enqueue lock. The transaction must 
then hold the enqueue lock until the deferred transaction 
45 record for the transaction is actually written to the deferred 
transaction queue. This technique effectively makes the 
sequence number assignment and the insertion of the 
deferred transaction record an atomic operation. The fol- 
lowing steps could be used to implement this technique: 
50 begin transaction 

perform transaction operations 
acquire enqueue lock 
acquire sequence number 
55 insert deferred transaction record into deferred transaction 
queue 

commit and release enqueue lock 
While this technique avoids the out-of-sequencc problems 
associated with the prepare sequence approach, it also 

60 creates a bottleneck in transaction processing. Specifically, 
when numerous concurrent processes complete execution at 
the same time, one will acquire the enqueue lock and the 
others will have to await their turn. Thus, while the trans- 
actions may be executing in parallel to take full advantage 

65 of the processing power of the hardware on which they are 
executing, they will have to be processed serially upon 
completion. 
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To avoid the boltleneck associated with the stamp locking 
process, a record can be maintained to indicate which 
deferred transaction records have been dequeued for which 
sites. For example, a plurality of flags may be stored in each 
deferred transaction record, where each flag corresponds to 
a destination site. Initially, all of the flags indicate that the 
deferred transaction record has not been dequeued. During 
each dequeue pass, the dequeue process inspects each 
deferred transaction record to determine whether the flag 
corresponding to the destination site associated with the 
dequeue process has been set. If the flag has been set, the 
deferred transaction record is skipped. If the flag has not 
been set, the dequeue process dequeues the deferred trans- 
action record. When a dequeue process dequeues the 
deferred transaction record, the dequeue process sets the flag 
within the deferred transaction record that corresponds to the 
destination site associated with the dequeue process to 
indicate that the deferred transaction record has been 
dequeued for that destination site. 

Unfortunately, the record flagging approach has the dis- 
advantage that each deferred transaction record wiU be 
updated once for each destination site. This disadvantage is 
significant because updates involve a relatively large amount 
of overhead and there may be a large number of destination 
sites. 

As an alternative to using flags within the deferred 
transaction records, a record that indicates which deferred 
transaction records have been dequeued for each destination 3Q 
site may be maintained external to the deferred transaction 
queue. For example, each dequeue process may maintain a 
dequeued transactions table into which the dequeue process 
inserts a row for each deferred transaction record that it 
dequeues, where the row identifies the transaction associated 35 
with the dequeued deferred transaction record. 

However, the dequeued transaction table approach also 
involves a significant amount of overhead. Specifically, a 
row must be generated and inserted for each destination site 
for every deferred transaction record. In addition, the 40 
dequeue query is expensive in that the entire deferred 
transaction queue may have to be scanned looking for 
deferred transaction records that are not recorded in a 
particular dequeued transaction table. 

45 

Queue Batch Numbers 

According to an embodiment of the invention, a "queue 
batch number" column is added to each deferred transaction 
record in the deferred transaction queue. When a deferred 
transaction record is initially inserted into the queue, the 50 
queue batch value is set to some default value. Before 
dequeuing deferred transaction records, each dequeue pro- 
cess "stamps" the deferred transaction queue by setting the 
queue batch values in all of the deferred transaction records 
that have the default queue batch value to a queue batch 55 
number that is greater than any queue batch niunber that has 
previously been assigned to any deferred transaction record. 
The dequeue process then dequeues all of the records that 
have queue batch numbers greater than the queue batch 
number used by that dequeue process in its previous batch 60 
stamping operation. 

The queue batch niunber stamping technique is illustrated 
in FIGS. 3A-3D. Referring to FIG, 3A, it illustrates an 
embodiment of the invention in which a deferred transaction 
queue 300 is implemented using a table. Deferred transac- 65 
tion records 308 are inserted into deferred transaction queue 
300 by a database server after the transactions are prepared 



at the local (source) site. Prior to insertion into deferred 
transaction queue 300, these deferred transactioa records are 
assigned the default queue batch value. In the illustrated 
embodiment, the default queue batch value is -5000. 

At the time illustrated in FIG. 3A, deferred transaction 
records for five transactions have been inserted into the 
deferred transaction queue 300. None of the transactions 
have yet been dequeued by any dequeue process, and 
therefore all still contain the default queue batch value. 
Dequeue process 302 has previously dequeued deferred 
transaction records with queue batch numbers up to 60, and 
therefore stores the value "60" as its LAST_BArCH num- 
ber. Dequeue process 304 has previously dequeued deferred 
transaction records with queue batch numbers up to 59, and 
therefore stores the value "59" as its LAST_BArCH num- 
ber. 

Prior to performing a dequeue operation, dequeue process 
304 performs a batch stamping operation on deferred trans- 
action queue 300, During the batch stamping operation, all 
deferred transaction records within deferred transaction 
queue 300 that currently hold the default queue batch 
number (-5000) are stamped with a higher queue batch 
number than has previously been assigned to any deferred 
transaction records. To ensure that the new queue batch 
number is higher than any previously assigned queue batch 
number, a queue batch counter 306 is used to track the 
highest previously assigned batch number. Initially, the 
queue batch counter is set to a value that is greater than the 
default queue batch number. At the time illustrated in FIG. 
3A, the highest previotisly assigned queue batch value is 60. 

FIG. 3B illustrates deferred transaction queue 300 after 
dequeue process 304 has performed a batch stamping opera- 
tion. The queue batch counter 306 is incremented, increasing 
the value of the counter to 61. The deferred transaction 
records within deferred transaction queue 300 that previ- 
ously stored the default queue batch value of -5000 now 
store the new queue batch value of 61. After the batch 
stamping operation, dequeue process 304 dequeues all of the 
deferred transaction records that have queue batch values 
that are higher than the highest queue batch value previously 
used by dequeue process 304, At the time illustrated in FIG, 
3B, the LAST_BArCH value of dequeue process 304 is 59, 
and the five deferred transaction records in deferred trans- 
action queue 300 have queue batch values of 61. Therefore, 
dequeue process 304 wiU dequeue all five of the deferred 
transaction records, 

FIG. 3C illustrates deferred transaction queue 300 after 
dequeue process 304 has performed a dequeue operation. 
The LAST_BArCH value of dequeue process 304 has been 
updated to reflect that dequeue process has dequeued all 
deferred transaction records with queue batch values up to 
61. 

At the time illustrated in FIG. 3C, five new deferred 
transaction records have been inserted into deferred trans- 
action queue 300 since the batch stamping operation per- 
formed by dequeue process 304, These new deferred trans- 
action records have been assigned the default queue batch 
value. As long as the new deferred transaction records were 
added after the batch stamping operation, the new deferred 
transaction records will not have been dequeued by dequeue 
process 304 regardless of whether they were inserted before 
or after the dequeue operation because dequeue process 304 
only dequeued those deferred transaction records with queue 
batch values greater than 59, 

Assume that at the time illustrated in FIG. 3C, dequeue 
process 302 performs a batch stamping operation. Dequeue 
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process 302 increments the queue batch counter to 62, and 
stamps all of the deferred transaction records that have the 
default queue batch value with the new queue batch value of 
62. FIG. 3D illustrates the stale of deferred transaction 
queue 300 after dequeue process 302 has performed such a 
batch stamping operation. Dequeue process 302 may then 
perform a dequeue operation in which dequeue process 302 
dequeues all deferred transaction records with queue batch 
values greater than 60. During the dequeue operation, 
dequeue process 302 would dequeue all of the deferred 
transaction records previously dequeued by dequeue process 
304, as well as all of the new deferred transaction records. 
After the dequeue operation, dequeue process 302 would 
update its LAST__BATCH value to 62. 

Assume that no new records arrive after the time illus- 
trated in FIG, 3D and the next batch stamping operation is 
performed by dequeue process 304. Under these conditions, 
the queue batch counter 306 would be incremented to 63, but 
none of the deferred transaction records within deferred 
transaction queue 300 will be updated. Dequeue process 304 
would only dequeue those deferred transaction records with 
queue batch values greater than the LAST_BATCH value of 
dequeue process 304. In the illustrated example, the LAST_ 
BATCH value of dequeue process 304 is 61. Therefore, 
dequeue process 304 would only dequeue those deferred 
transaction records that it did not dequeue in its previous 
dequeue operation. 

By comparing LAST^BATCH numbers with queue batch 
numbers, dequeue processes can quickly distinguish 
betv/een deferred transaction records they have akeady 
dequeued, and deferred transaction records they have not yet 
dequeued. Using this technique, many deferred transaction 
records can be concurrently written into the deferred trans- 
action queue 300 out of prepared time order without 
adversely affecting dequeue operations. Therefore, the 
bottleneck associated with the sequence stamp locking tech- 
nique described above is avoided. 

Further, each deferred transaction record is only updated 
once, not once for every destination site. Specifically, each 
deferred transaction record will only be updated during the 
first batch stamping operation performed after the deferred 
transaction record has been inserted into the deferred trans- 
action queue 300 and stamped with a non-default queue 
batch number. Therefore, this technique avoids the signifi- 
cant overhead associated with the record flagging techniques 
described above. 

Sequential Processing 

According to one embodiment, dequeued transactions are 
processed sequentially, not as atomic "batches" of transac- 
tions. The order in which a transaction is processed is based 
on both the batch number of the transaction and the prepared 
time of the transaction. Specifically, transactions are 
dequeued in <batch number, prepared timc> order Thus, for 
each dequeue process, transactions with older batch num- 
bers arc processed before transactions with newer batch 
numbers. Within a batch, transactions with older prepared 
times are processed before transactions with newer prepared 
times. 

Because batches are not processed as atomic units, the 
LAST_BATCH value alone is not enough to indicate which 
transactions have and have not been processed by a particu- 
lar dequeuing process. According to one embodiment, a 
<LAST_JBA1'CH, transaction identifier> value pair is main- 
tained by each dequeue process to indicate the last transac- 
tion to be processes by the dequeuing process. After a 
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failure, the <LAST_BArCH, transaction identifier> value 
pair for a dequeue process may be used to determine which 
transactions must still be processed by the dequeue process. 

5 Scheduling Batch Stamping Operations 

In the embodiment described above, the dequeue pro- 
cesses perform batch stamping operations before every 
dequeue query they perform. However, a batch stamping 
operation does not need to be performed before a dequeuing 

30 query for a given site as long as a batch stamping operation 
has been performed subsequent to the last dequeuing query 
for the given site. Further, as long as at least one batch 
stamping operation is performed between consecutive 
dequeue queries for a given site, the actual number of batch 

15 stamping operations performed between consecutive 
dequeue operations for a site will not affect the dequeue 
query. 

For example, at the time shown in FIG. 3C, dequeue 
process 302 can perform a dequeue query without first 
performing a batch stamping operation. This is possible 
because dequeue process 304 performed a batch stamping 
operation since the last dequeue query performed by 
dequeue process 302. Under these circumstances, the newly 
arrived deferred transaction records would not be dequeued 
by dequeue process 302 until a subsequent dequeue query is 
performed by dequeue process 302. The present invention is 
not limited to any particular mechanism for scheduling batch 
stamping operations relative to dequeue operations. 

In the embodiment described above, each destination site 
has a dequeue process and the dequeue processes perform 
the batch stamping operations. In alternative embodiments, 
each destination site may have more than one dequeue 
process, and each dequeue process may service more than 
one destination site. Further, batch stamping operations may 
be performed by one or more processes executing indepen- 
dent of the dequeue processes, or by recursive transactions 
initiated by the dequeue processes. 

Purging The Deferred Transaction Queue 

40 

Once a deferred transaction record has been processed for 
all destination sites to which it must be propagated, the 
deferred transaction record can be deleted from the deferred 
transaction queue. According to one embodiment, a process 

45 responsible for purging the deferred transaction queue reads 
the <LAST_BArCH, transaction-id> value pair for each of 
the destination sites. The <LAST__BATCH, transaction-id> 
value pair maintained by each dequeue process indicates the 
last transaction encountered by that dequeue process. 

50 Each dequeue process will maintain its own <LAST_ 
BATCH, transaction-id> value. Of all the transactions thus 
identified, the transaction with the lowest <batch number, 
prepared time> value represents the most recent transaction 
that has been encountered by the dequeue processes for all 

55 sites (the "global bookmark"). The purging process deletes 
all deferred transaction records in the deferred transaction 
queue for transactions that have lower <batch number, 
prepared time> values than the global bookmark (except for 
transactions currently marked with the default batch value), 

60 since these deferred transaction records have been dequeued 
for all destination sites for which they need to be dequeued. 

A dequeue process may not dequeue some deferred trans- 
action records it encounters because the deferred transaction 
records do not have to be propagated to the destination site 

65 associated with the dequeue process. According to one 
embodiment, the <LAST__BA1'CH, transaction-id> value 
for each site is updated based on all deferred transaction 
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records encountered (but not necessarily dequeued) during 
the dequeue operations. Specifically, each dequeue process 
updates its <LAST_BATCH, transaction -id> value based 
on all deferred transaction records it sees during a dequeue 
operation, including those deferred transaction records that 
it does not actually dequeue. 

For example, assume that the <LAST„BATCH, 
transaction-id> value for a dequeue process associated with 
a destination site A is <20, 5>. During a dequeue operation, 
the dequeue process encounters two deferred transaction 
records with batch numbers higher than 20. The first 
deferred transaction record is for a transaction TXA, has a 
queue batch number of 23 and must be dequeued for site A. 
The second deferred transaction record is for a transaction 
TXB, has a queue batch number of 25 and does not have to 
be dequeued for site A. Under these circumstances, the 
dequeue process updates its <LAST__BArCH, transaction- 
id> value to <25, TXB> after performing the dequeue 
operation. 

Consequently, the <LAST_BArCH, transaction -id> 
value for each site will be updated according to the fre- 
quency (Fl) that dequeue operations are performed for that 
site, not the firequency (F2) at which changes are actually 
propagated to that site. For sites to which changes must 
rarely be propagated, Fl may be significantly greater than 
F2- As a result, the delay between that time al which a 
deferred transaction record has been dequeued for all nec- 
essary sites and the time at which the deferred transaction 
record is deleted from the deferred transaction queue can be 
significantly shorter than it would be if the <LAST_ 
BATCH, transaclion-id> values were only updated based on 
the deferred transaction records that a dequeue process 
actually dequeues. 

Transaction E^opagation 

In replication, propagating a transaction to a destination 
site is performed by causing the destination site to execute 
operations that make at the destination site the changes made 
by the transaction at the source site. According to one 
embodiment, the source site transmits a stream of informa- 
tion to the destination site to cause such operations to be 
performed. 

Specifically, the source site sends deferred transactions to 
a destination site as a sequence of remote procedure calls, 
essentially described in U.S. patent application Ser. No. 
08/126,586 entitled "Method and Apparatus for Data 
Replication", filed on Sep. 24, 1993 by Sandeep Jain and 
Dean Daniels. Defened transaction boundaries are marked 
in the stream by special "begin-unit-of-work*' and "end-unit- 
of-work" tokens that contain transaction identifiers. 

The destination site receives messages on the stream. 
When it receives the "begin-unit-of-work** token, the desti- 
nation site starts a local transaction for executing the pro- 
cedure calls that will follow the "begin-unit-of-work" token. 
Such transactions are referred to herein as replication trans- 
actions. A replication transaction executes the procedure 
calls specified in the stream until h encounters the next 
"end-unit-of-work" token. When the "end-unit-of-work" 
token is encountered, the replication transaction is finished. 
The destination site continues reading and processing 
deferred transactions using replication transactions in this 
manner until the stream is exhausted. 

When distributed transactions are used to perform 
replication, a replication transaction enters a "prepared" 
state when the "end-unit-of-work" token is encountered. The 
destination site informs the source site that the replication 
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transaction is prepared and awaits a commit instruction from 
the source site. The two phase commit operation used by 
distributed transactions is described in greater detail below. 
Also described below is an alternative to the use of distrib- 
uted transactions in which the replication transaction can be 
committed immediately after it is prepared, without further 
communication with the source site. 

Dependencies Between Transactions 

After a deferred transaction record has been dequeued for 
a destination site, the changes identified in the deferred 
transaction record are propagated to the destination site. 
However, the order in which the changes were made at the 
source site places some restrictions with respect to the order 
in which the changes must be made at the destination site. 

Specifically, if a first transaction has written to a data item 
that is subsequently written to or read by a second 
transaction, then all changes made by the first transaction 
must be made at a destination site before any of the changes 
made by the second transaction are made at the destination 
site. In these circumstances, the second transaction is said to 
"depend on" the first transaction. When the second transac- 
tion merely reads the data item, the dependency is referred 
to as a write-read dependency. When the second transaction 
writes to the data item, the dependency is referred to as a 
write-write dependency. 

During replication, it is critical that the order of write- 
write dependencies be observed so that the copy of the data 
item at the destination site will reflect the correct value after 
the two writes have been applied at the destination site. It is 
desirable that the order of write-read dependencies be 
observed during replication to reduce the likelihood that the 
database at the destination site will transition through invahd 
intervening states during the application of the changes at 
the destination site. 

Another type of dependency, referred to as a read-write 
dependency, exists if a first transaction reads a data item that 
is subsequently written to by a second transaction. However, 
read-write dependencies are not relevant in the context of 
replication since only updates, not reads, are propagated to 
the destination sites. 

There is a correlation between the prepared times of 
transactions and whether it is possible for a dependency to 
exist between the transactions. Specifically, transactions are 
not able to read or update any changes made by any other 
transactions until the other transactions are prepared and 
committed. ITierefore, a transaction TXA cannot depend on 
a transaction TXB if the prepared time of the transaction 
TXA is earlier than the prepared time of transaction TXB. 

There is also a correlation between the times that the 
deferred transaction records for transactions are written into 
the deferred transaction queue and whether it is possible for 
a dependency to exist between the transactions. Specifically, 
if every transaction acquires its prepared time as its last 
action before entering the committed state, then the deferred 
transaction record for any given transaction will never be 
written into the deferred transaction queue before the 
deferred transaction records of any transactions on which the 
given transaction depends. For example, if TXA depends on 
TXB, then it is guaranteed that the deferred transaction 
record for TXB will be written to the deferred transaction 
queue before the deferred transaction record for TXA. This 
is true because the changes made by TXB are not made 
visible to any transactions (including TXA) until the 
deferred transaction record for TXB is written to the 
deferred transaction queue. 
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Single-Stream Propagation As explained above, dequeue processes 302 and 304 

nn*» x^-^xj rr> *.nci.r^ th^t r».A^ u., , tror.o«^t;^« dcQueue deferred transaction records from deferred transac- 
Une way to ensure trial changes made by a transaction are t l j- . h . . j • t-i^ >i 
„„Ir«j „f*„,*u u J u *1 . »• tion queue 300. In the embodiment illustrated in FIG. 4, 
always appbed after the changes made by the transactions on , ^ j tn* • . .u j j j r a 
i. u*u . . J J • i_ dequeue processes 302 and 304 insert the dequeued deferred 
which the transaction dpnds IS to propagate the changes in t^.^^^.^ii^^^ ^^^^s into scheduler heaps 410 and 420, 
a sequence based on the batch numbers and the prepared respectively, and propagation mechanisms 400 and 402 
times ot the transactions. transmit the transactions specified in the deferred transaction 
Specifically, a single stream can be opened to each records over multiple streams to destination sites 404 and 
destination site. Each process in charge of propagating 434, respectively. Dequeue processes 302 and 304 insert the 
changes to a destination site introduces the changes into the deferred transaction records of each batch into the scheduler 
stream in batch order. The changes within each batch are heap in an order based on the prepared times of the corre- 
sorted in prepared time order so that deferred transaction spending transactions, thus ensuring that the defeaed trans- 
records with earlier prepared times are introduced into the action record for any given transaction will never be inserted 
stream prior to deferred transaction records with later pre- into the scheduler heap before a deferred transaction record 
pared times. Since changes are applied at the destination site of a transaction on which the transaction depends, 
in the order in which they arrive in the stream, Ihe changes When dequeue process 302 places a deferred transaction 
made by each transaction will be made at the destination site record in scheduler heap 410, the deferred transaction record 
after the changes made by any transactions upon which the is initially marked as "unsent". Scheduler process 412 is 
transaction depends. responsible for passing the transactions associated with the 
The prepared-time ordering of the deferred transaction 20 deferred transaction records in scheduler heap 410 to stream 
records may be incorporated into the dequeue process. control processes 414, 416 and 418 in a safe manner. To 
Specifically, the dequeue query: ensure safety, scheduler process 412 cannot pass a transac- 
select * from queue_table tion to a stream control process if it is possible that the 
where (queue_batch_number>last_batch) transaction depends on a transaction that (1) has been 
order by queue_batch, prepared_time; 25 P^pagated to destination site 404 using a different stream 
will retrieve new batches of deferred transaction records ^^"^^^ P^°^^^! (2) is not known to have been commit- 
from the deferred transaction queue and order the deferred the destination site 404. In addition, the scheduler 
transaction records based on batch number and prepared P^*=^^^ ^^'^^^ .P^^ ^ transaction to a stream control 
time. Based on this ordering, if any transactions in a given P^^^^s ^ f ^^^le that the transaction depends on a 
batch depend on each other, their changes will be transmitted 30 ^'.^^^^^""^ ^^^^ propagated to destinaUon 
in the appropriate order. Further, as explained above, the ^'^ According to one embodiinent of the invention, 
deferred transaction record for a transaction is always writ- P^^^^ :f safe schedulmg of trans- 
ten into the deferred transaction queue after the deferred propagation to destination site 404 using the sched- 
transaction records for the transactions on which it depends. "^^"^ techmques illustrated in FIG. 5. 
Therefore, it is guaranteed that subsequent batches will not 35 Referrmg to FIG. 5, it is a flow chart lUuslratmg steps for 
contain transactions on which any of the transactions in the scheduling the propagation of transactions accordmg to one 
current batch depend, embodiment of the invention. At step 500, the scheduler 

process 412 inspects the deferred transaction records in the 
Multiple-Stream Propagation scheduler heap 410 to identify an unsent deferred transaction 
The single-stream propagation technique described above 40 record. When the scheduler process 412 encounters an 
ensures that changes will be applied at the destination sites unsent deferred transaction record, scheduler process 412 
in the correct order. However, performance is reduced by the determines whether the transaction for that deferred trans- 
fact that only one stream is used to propagate changes to action record could possibly depend on any transaction 
each destination site. According to one embodiment of the associated with any other deferred transaction record in the 
invention, multiple streams are used to propagate updates to 45 scheduler heap 410 (step 502). The determination performed 
a single destination site. Because changes sent over one by scheduler process 412 during step 502 shall be described 
stream may be applied in any order relative to changes sent in greater detail below. 

over another stream, a scheduling mechanism is provided to If the transaction associated with the deferred transaction 

ensure that changes made by a given transaction will never record could depend on any transaction associated with any 

be applied prior to the changes made by transactions on 50 other deferred transaction record in the scheduler heap 410, 

which the given transaction depends. then the transaction associated with the deferred transaction 

Referring to FIG. 4, it illustrates propagation mechanisms record is not passed to any stream control process and 

400 and 402 that use multiple streams to propagate trans- control passes back to step 500. If the transaction associated 

actions to destination sites according to an embodiment of with the deferred transaction record could not possibly 

the invention. Propagation mechanisms 400 and 402 propa- 55 depend on any transaction associated with any other 

gate transactions to destination sites 404 and 434, respec- deferred transaction record in the scheduler heap 410, then 

tively. Propagation mechanism 400 includes a scheduler the transaction associated with the deferred transaction 

process 412, a scheduler heap 410 and three stream control record is passed to a stream control process at step 504. The 

processes 414, 416 and 418. Each of stream control pro- stream control process propagates the transaction to the 

cesses 414, 416 and 418 manages an instance of the stream- 60 destination site 404. At step 506, the deferred transaction 

ing protocol used to propagate transactions to destination record is marked as "sent". Control then passes back to step 

site 404. Similarly, propagation mechanism 402 includes a 500. 

scheduler process 422, a scheduler heap 420 and three Periodically, the propagation mechanism 400 receives 

stream control processes 424, 426 and 428. Each of stream from the destination site 404 messages that indicate which 

control processes 424, 426 and 428 manages an instance of 65 transactions have been committed at the destination site. In 

the streaming protocol used to propagate transactions to response to such messages, the deferred transaction records 

destination site 434. associated with the transactions are removed from the sched- 
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uler heap 410. The removed deferred transaction records no time and space limitations, it is not practical to store a 

longer prevent the propagation of transactions that depended precise representation of the true dependency relation 

OD the transactions associated with the removed deferred between all transactions. 

transaction records. Rather than attempt to maintain a precise representation 

The components of propagation mechanism 402 operate 5 of acuial dependencies, a database system is provided in 

in the same manner as the corresponding components of which a mechanism for approximating dependencies is 

propagation mechanism 400. Specifically, scheduler process maintained. The approximation must be "safe" with respect 
422 passes transactions associated with unsent deferred the true dependency relation. That is, the approximation 

transaction records in scheduler heap 420 to stream control always indicate that a transaction TXA depends on 

processes 424, 426 and 428 when the transaction could not lO another transaction TXB if TXA actually depends on TXB. 

possibly depend on any transaction associated with any However, the approximation does not have to be entirely 

other deferred transaction record in the scheduler heap 422. accurate with respect to two transactions where there is no 

For the purposes of explanation, the scheduler processes ^^^^^^ dependency. Thus, it is acceptable for there to exist 
412 and 422 and the dequeue processes 302 and 304 have transactions TXA and TXB such that the 

been described as separate processes. However, the actual WroxmiaUon mdicates that TXA depends on IXB when 

division of ftinctionaUty between processes may vary from ^ ^^^^ ^^^^y °° 

implementation to implementation. For example, a single ^ technique for such an approximation is described in 

process may be used to perform both the dequeuing and ^ patent application Sen No. 08/740,544, filed Oct. 29, 

scheduling operations for a given site. Similarly, a single ^^^^^ ^warl et al. entitled "Tracking Dependencies 

process may be used to perform the dequeuing and sched- Between Transactions m a Database" (attorney docket no. 

uling operations for all destination sites. The present inven- 3018-010), the contents of which are incorporated herem by 

tion is not Umited to any particular division of functionality reference. In that technique, a "dependent tune value" is 

between processes computed for each transaction. The dependent time value for 

™, ... Ml . . 1 ' r>r^ A • J J, 1 a given transaction is the maximum commit lime of any 

The embodiment illustrated in FIG. 4 includes three ♦ *■ *u * • i * j * -* *u . -.u 
. , J - IT ..25 transaction that previously wrote a data item that was either 
stream control processes per destination site. However, the , if *i_ • * n • .u- 

, 1 . r . . , r read or written by the given transaction. Usine this approxi- 

actual number of stream control processes may vary from ^. , - . ^ . • . u 

I * . • 1 * 1 mation mechanism, the determination at step 502 may be 

implementation to implementation. For example, ten r , . • j j . i / .i. 

, . -. jL. u • _. performed by comparing the dependent time value of the 

streams may be maintained between each source site and f - . ^ -.t. .t. . ^ c 

, J . •* Au 1 * . L transaction associated with the unsent deferred transaction 

each destination site. Alternatively, ten streams may be .„ . . r*. . • . . 

K^*„,«« *u / a a r record with the prepare times of the transactions associated 

mamtamed between the source site and a destination site, ^^^^ other deferred transaction records in the scheduler 

while only two streams are maintained between the source . ir.u a a^\ *• ^^^^^^ f^^^!.^ "^n ^ ^ ^ 

A A iv . -J *• *• * r -*u *u u r heap. If the dependent time value is less than all prepare time 
site and a different destination site. Further, the number of i *u *i: * *• *j a *u *u 

. . . J. ^ • ' J J values, then the transaction cannot depend on any other the 

streams maintamed between the source site and destination , V. • * j -^u *u j r j * j 

, , • 11 J- * J u J r * 1. transactions associated with the deferred transaction records 

sites may be dynamically adjusted based on factors such the *v, * *i • *u u a ^ u n*u - 

./vi ^ ... .J 35 that are currently in the scheduler heap. Otherwise, it is 

currently available communication bandwidth. ui *u * . a i r ,i_ 

, , , . . possible that the transaction depends on one of the other 

In the embodunents descnbed above, a transaction is not transacUons in the scheduler heap, 
propagated as long as the transaction may depend on one or ^sing this technique, it is possible for a transaction TXA 
more transactions that are not known to have been commit- propagated before the another transaction TXB even 

ted at the destination site. However, a transaction can be tk/ • *• -^^-^ 4 *u « t-va a a 

^ , , , . , ... 1 , " when the approximation mdicates that TXA depends on 

safely propagated even when transactions that it may depend t-vt» u *u- n i ^ c 5. 

*^ f, " , . , . , . , - . TXB. However, this will only occur if the deferred transac- 

on are not known to have been committed at the destination , r ■ • ^ ^ • . .u t. 

, J • r. ,1 . . tion record tor IXA is inserted into the scheduler neap 

site under certain conditions. Specincally, assume that the . f *u a c a * «• i r t-vi-» t» .i_ 

. , , J , . ,^ . ™^ ^ before the deferred transaction record for TXB. Because the 

scheduler process determines that a transaction TXA cannot a e a* *• a u j . l 

, ,^ . , . , deterred transaction records within each dequeue batch are 

possibly depend on any propagated transactions that are not u„ ^ ^ ^ u^f^ u • • -* ^ • * *u l j 

f . . L ^i . c ^ . sorted by prepare time before being inserted into the sched- 

known to have committed except for two transactions TXB . . t-va u * * n j j rj^r^ -r .x. 

A T\rn Tf •* * 1 *u * f *• -r^T^ a ^^ap, TXA could not actually depend on TXB if the 

and TXC. If it is known tha transactio^TOB and TXC ^^^^^^^^ transaction record for TXA is inserted into the 

were propagated in the same stream, then TXAcan be safely ^^^^^^^^ ^ ^^^^^^ ^^^^^^^^ transaction record of 
propagated m that same stream. ^ 

According to one embodiment of the invention, a record 50 
is maintained to indicate which stream was used to propa- Distributed Transactions 

gate each "sent" transaction. In this embodiment, transac- To ensure the integrity of a database, the database must 

tions may be propagated to a destination site when (1) all show all of the changes made by a transaction, or none of the 

transactions on which they may depend are known to have changes made by the transaction. Consequently, none of the 
committed at the destination site, or (2) all transactions on 55 changes made by a transaction are made permanent within 

which they may depend which are not known to have a database until the transaction has been fully executed. A 

committed at the destination site were propagated over the transaction is said to "commit" when the changes made by 

same stream. In the latter case, transactions must be propa- the transaction are made permanent to the database, 
gated using the same stream as was used to propagate the According to one replication approach, the original trans- 
transactions on which they may depend. 50 action at the source site and the propagated transactions at 
De endenc Determinati destination sites are all treated as "child transactions" 

epen ency e ermma ion ^^^^ ^ single "distributed" transaction. To 

As described above, scheduler processes 412 and 422 ensure consistency, all changes made by a distributed trans- 
must determine whether the transactions associated with action must be made permanent at all sites if any of the 
unsent deferred transaction records could possibly depend 65 changes are made permanent at any site. The technique 
on any transactions associated with the other deferred trans- typically employed to ensure this occurs is referred to as 
action records stored in the scheduler heap. However, due to two-phase commit. 
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During the first phase of Uvo-phase commit, the process 
that is coordinating the distributed transaction (the "coordi- 
nator process") sends the child transactions to the sites to 
which they correspond. In the context of replication, the 
coordinator process will typically be a process executing at 5 
the source site. The child transactions are then executed at 
their respective sites. When a child transaction is fully 
executed at a given site, the child transaction is said to be 
"prepared". When a child transaction is prepared at a site, a 
message is sent from the site back to the coordinating 10 
process. 

When all of the sites have reported that their respective 
child transactions are prepared, the second phase of the two 
phase commit begins. During the second phase of two phase 
commit, the coordinator processes sends messages to all 15 
sites to instruct the sites to commit the child transactions. 
After committing the child transactions, the sites send mes- 
sages back to the coordinating process to indicate that the 
child transactions are committed. When the coordinating 
process has been informed that all of the child transactions 20 
have committed, the distributed transaction is considered to 
be committed. If any child transaction fails to be prepared or 
committed at any site, the coordinator process sends mes- 
sages to all of the sites to cause all child transactions to be 
"rolled back", thus removing all changes by all child trans- ^5 
actions of the distributed transaction. 

The advantage of implementing replication through the 
use of distributed transactions is that the distributed trans- 
actions can be successfully rolled back and reapplied as an 
atomic unit if a failure occurs during execution. However, 
performing a two phase commit imposes a significant delay 
between the completion of transactions and when the trans- 
actions are actually committed. Specifically, two round trips 
(prepare, prepared, commit, committed) are made between 
the source site and each destination site for every distributed 
transaction before the distributed transaction is committed. 
The latency imposed by these round trip messages may be 
unacceptably high. 

Replication Without Distributed Transactions 

According to an embodiment of the invention, streams of 
deferred transactions are propagated from a source site to 
one or more destination sites without the overhead of 
distributed transactions. Specifically, transactions at the 45 
source site are committed unilaterally without waiting for 
any confirmations from the destination sites. Likewise, the 
destination sites execute and commit replication transactions 
without reporting back to the source site. To ensure database 
integrity after a failure, the source and destination sites store jq 
information that allows the status of the replication trans- 
actions to be determined after a failure. 

According to one embodiment of the invention, each 
destination site maintains an apphed transactions table and 
the source site maintains a durable record of which trans- 55 
actions it knows to have committed at the destination site (a 
"low water mark**). When a replication transaction commits 
at a destination site, an entry for the replication transaction 
is committed to the applied transaction table. After a failure, 
the low water mark at the source site and the information 60 
contained in the applied transactions table at the destination 
site may be inspected to determine the status of all transac- 
tions that have been propagated to the destination site. 

Specifically, if either the low water mark at the source site 
or the applied transaction table at the destination site indi- 65 
cates that a transaction has been committed at the destination 
site, then the changes made by that transaction do not have 
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to be propagated again as part of the failure recovery 
process. On the other hand, if neither the low water mark at 
the source site nor the applied transaction table at the 
destination site indicates that a transaction that must be 
propagated to a destination site has been committed at the 
destination site, then the transaction will have to be propa- 
gated again as part of failure recovery. 

Purging The Scheduler Heap 

ThQ scheduler heap does not grow indefinitely. According 
to one embodiment, entries are periodically deleted from the 
scheduler heap in response to messages received at the 
source site from the destination site. The messages contain 
"committed transactions data" that indicates that one or 
more transactions that were propagated from the source site 
have been successfully executed and committed at the 
destination site. In response to receiving committed trans- 
actions data from a destination site, the entries in the 
scheduler heap for the transactions specified in the commit- 
ted transactions data are deleted from the scheduler heap. 

When entries are deleted from the scheduler heap, it is not 
necessary to immediately update the low water mark main- 
tained by the source site to indicate that the transactions 
specified in the committed transactions data were committed 
at the destination site because the appHed transaction table 
at the destination site already indicates that the transactions 
specified in the committed transactions data were committed 
at the destination site. Consequently, those transactions will 
not be retransmitted to the destination site after a failure. 

The committed transactions data may be, for example, the 
transaction sequence number of the last transaction thai was 
committed at the destination site that arrived at the destina- 
tion site on a particular stream. The scheduler keeps track of 
which transactions were sent on which streams. Since the 
transactions that are propagated on any given stream are 
processed in order at the destination site, the source site 
knows that all transactions on a that particular stream that 
preceded the transaction identified in the committed trans- 
action data have also been committed at the destination site. 
The entries for those transactions are deleted from the 
scheduler heap along with the entry for the transaction 
specifically identified in the committed transactions data. 

Flush Tokens 

Various events may cause a destination site to transmit 
messages containing committed transactions data. For 
example, such messages may be sent when a buffer is filled 
at the destination site, or in response to "flush tokens". A 
flush token is a token sent on a stream from the source site 
to the destination site to flush the stream. 

The destination site responds to the flush token by execut- 
ing and committing all of the transactions that preceded the 
flush token on that particular stream, and by sending to the 
source site committed U-ansaction information that indicates 
which transactions that have been propagated from the 
source site on that stream have been committed at the 
destination site. As mentioned above, this committed trans- 
action information may simply identify the most recently 
committed transaction from the stream on which the flush 
token was sent. The source site knows that all transactions 
that preceded the identified transaction on the stream have 
also been committed at the destination site. 

Updating The Low Water Mark 

The source site periodically updates the low water mark 
associated with a destination site based on the committed 
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transaction information received from the destination site. than T in the dequeue sequence may also have committed at 

Various mechanisms may be used to determine a low water the destination site, but this fact may not yet be known at the 

mark for a destination site based on committed transaction source site, 

information received from the destination site. Maintaining a low water mark table at the source site has 

For example, according to one embodiment of the 5 the benefit that after a failure, the source site only needs to 

invention, an ordered list of transactions is maintained for be informed about the status of transactions that are above 

each stream that is being used to send deferred transactions the low water mark, 
from a source site to a destination site. Each element in an 

ordered list represents a transaction that was propagated on Dequeue Sequence Numbers 

the stream associated with the ordered list. The order of the ^0 According to one embodiment of the invention, each 

elements in the ordered list indicates the order in which the deferred transaction record is given a dequeue sequence 

corresponding transactions were propagated on the stream niunber upon being dequeued. For each destination site, 

associated with the ordered list, dequeue sequence numbers are assigned consecutively as 

As mentioned above, committed transaction information transactions are dequeued. The fact that the sequence num- 

may identify a transaction that is known to have been bers are consecutive means that a skip in the sequence 

committed at the destination site. In response to the com- indicates the absence of a transaction, rather than just a delay 

milted transaction information, a process at the source site between when transactions were assigned sequence num- 

removes from the ordered list of the appropriate stream the bers. The dequeue sequence number associated with a 

element that corresponds to the identified transaction, as transaction is propagated to the destination site with the 

weU as all preceding elements. transaction, llie destination site stores the dequeue sequence 

By truncating the ordered lists for each stream in this numberof a transaction in the applied transacdon table entry 
manner, the low water mark may be determined by inspect- transaction. 

ing the ordered lists for all streams to a given destination site For example, FIG. 6 illustrates a replication system in 

and identifying the oldest transaction represented on the which deferred transactions are propagated from a source 

lists. All transactions older than that transaction have nec- site 602 to a destination site 620 over a plurality of propa- 

essarily been committed at the destination site, so data gation streams 622. The scheduler heap 604 at the source site 

identifying that transaction may be stored as the low water ^02 contains entries for the transactions that have been 

mark for that destination site. assigned dequeue numbers 33 through 70, In the illustrated 

example, all of these transactions have been propagated to 

Purging The Applied Transaction Tables the destination site 620 over one of the propagation streams 

. ^ , . . 622. Therefore, all of the entries are marked as "sent". It 

ITie mamtenance of an applied transaction table at every ^^^^^^ ^ ^^^.^ ^^^^ scheduler heap 604 may addition- 
destination site allows for accurate recovery after a failure in -^^^^^^ ^^^^^^ ^^^^^ ^^^^^^ transactions 

a replicated environment. Further, if each applied transac- ^^^^ ^^^^ j ^een propagated. 

tion table is allowed to grow indefinitely, then maintenance 35 i . ^ * *u •* c 

o I , 1 wi. ■* ■ u The low water mark stored at the source site 602 for 

of a low water mark at the source site IS unnecessary because , . . , 

, .„ * 11 r*u * J destination site 620 is 33. In response to a purge message 

the apphed transaction table Will reflect all of the propagated * • ♦u i . i n * • j 

*u * u J * *u J *• contaming the low water mark 33, all entnes with dequeue 

transactions that have ever committed at the destmatiOD site. ^ . ^^u u jr 

However, an infinitely growing data structure is generally ^".<'°'=f . ^'^"^^J^ have been removed from 

not practical. TTnerefore. a mechanism is provided for peri- « applied transaction table 650. Applied transaction table 650 

u » • e .u 1- J* *• * ui currently indicates that the transactions associated with 

odically purging entnes from the applied transaction tables , ^ u 10 -^r j c-^ u- u w 

J- * i_ J* * c ^if • dequeue numbers 33, 34, 35, 40 and 53, which are equal to 

accordmg to one embodunent of the invention. . , . 1 i. j 

^ or above the low water mark of 33, have been committed at 

To purge records from an applied transaction table, the jhe destination site 620. 
source site sends a "purge" message to the destination site. 

The purge message indicates the low water mark that is Range-Based Commit Transactions Data 

durably stored at the source site. Upon receiving this mes- . . ^ . • . . ^ 

/ J Durmg recovery, a source site must determine the status 

sage from the source site, the destmation site may then .u ; u u * u -1 *• 

, f , . 11' . t_i I- II of transactions that have been propagated to each destination 

delete from the apphed transaction table the entries for all * 1 • j i_ .1. . 5 r .1. . 

, , . , site. As explamed above, the status of the transactions may 

transactions that are older than the transaction associated . j . • j l j 1 . 1 j • r 

with the low water mark determined based on the low water marks and informa- 

tion in the applied transaction tables of the destination sites. 

Significantly, a purge message is not sent to the dcstina- jf the appUed transaction table at a destination site contains 

tion site unless the low water mark specified m the purge ^n entry for a transaction or if the transaction falls below the 

message has been stored on non-volatile memory at the low water mark for that destination site, then the transaction 
source site. Consequently, the applied transactions table wiU committed at the destination site prior to the failure, 

always identify all transactions that (1) are above the dura- Otherwise, the transaction had not committed at the desU- 

bly stored low water mark and (2) have been propagated to nation site prior to the failure. 

and committed at the destination site. g^^^^^ ^^^^^ maintained at the source 

The Low Water Mark Table ^""^'^ ^^^^ ^^^^ informed of the 

60 transactions that were committed at a destination site that are 

According to one embodiment, the source site maintains above the low water mark for the destination site (the 

a "low water mark table" that contains a low water mark for "above-the-mark committed transactions"). Therefore, one 

each destination site. As explained above, the low water step in the recovery process is communicating to the source 

mark for a destination site identifies a transaction T such that site information that identifies the set of above-the-mark 
every transaction that was dequeued before T is known to 65 committed transactions. 

have been applied and committed at that destination site. Even when low water marks are maintained at a source 

Under certain conditions, some transactions that are later site, the set of above-the-mark committed transactions may 
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Still be huge if the low water marks were not updated 
recently before the failure. Therefore, according to one 
embodiment of the invention, the set of above -the-m ark 
committed transactions is sent from the destination site to 
the source site as a series of dequeue sequence number 
ranges. 

According to one embodiment, the set of above-the-mark 
committed transactions is sent from the destination site to 
the source site in the form of tuples, where each tuple 
identifies a range of dequeue sequence numbers. For 
example, assume that destination site 620 had, prior to a 
failure, committed the transactions propagated from source 
site 602 with dequeue sequence numbers up to 55, with 
dequeue sequence numbers from 90 to 200, and with 
dequeue sequence numbers from 250 to 483. 

After the failure, source site 602 sends the low water mark 
33 to destination site 620 to request the set of above-the- 
mark committed transactions that were propagated to des- 
tination site 620 from source site 602. In response, destina- 
tion site 620 sends back to source site 602 the tuples (55, 90), 
(200,250) and (483,-). With this information, the recovery 
process knows that all transactions that fall within the 
indicated ranges will have to be re -propagated to destination 
site 620. 

Significantly, the number of mples that must be sent as 
committed transaction information is limited to the number 
of gaps between the dequeue sequence numbers of commit- 
ted transactions, and the number of gaps is bounded by the 
size of the scheduling heap. Therefore, if the original trans- 
action heap was small enough to be stored in dynamic 
memory prior to the failure, then the committed transaction 
information should fit in the dynamic memory during recov- 
ery. 

In the foregoing specification, the invention has been 
described with reference to specific embodiments thereof. It 
will, however, be evident that various modifications and 
changes may be made thereto without departing from the 
broader spirit and scope of the invention, llie specification 
and drawings are, accordingly, to be regarded in an illus- 
trative rather than a restrictive sense. 

What is claimed is: 

1. A method for processing records that belong to a set of 
records, where records are repeatedly being added to said set 
of records, where each record in said set of records has to be 
processed once for each of a plurality of entities, the method 
comprising the steps of: 

marking each record that is added to said set of records 

with a default batch value; 
for each entity of said plurality of entities, processing a 
batch of said records by performing the steps of: 
reading a last batch value associated with said entity; 
processing the records in said set of records that are 
marked with batch values that are more recent than 
said last batch value associated with said entity; and 
updating the last batch value associated with said entity 
to a most recent batch value of the records processed 
for said entity; 
between processing consecutive batches for an entity of 
said plurality of entities, marking said set of records by 
performing the steps of: 

updating a batch counter value to reflect a more recent 

batch number; and 
marking all records in said set of records that have said 

default batch value with said batch counter value, 

2. The method of claim 1 wherein: 

the plurality of entities is a plurality of destination sites; 
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the set of records is a deferred transaction queue where 
each record corresponds to a set of changes to be 
propagated to one or more of said plurality of destina- 
tion sites; 

the step of marking each record that is added to said set 
of records with a default batch value includes marking 
each record that is added to said deferred transaction 
queue with a default batch value; 

the step of reading a last batch value associated with said 
entity includes the step of reading a last batch value 
associated with a destination site of said plurality of 
destination sites; 

the step of processing the records in said set that are 
marked with batch values that are more recent than said 
last batch value associated with said entity includes the 
step of dequeuing for said destination site the records in 
said set that are marked with batch values that are more 
recent than said last batch value associated with said 
destination site; and 

the step of updating the last batch value associated with 
said entity to a most recent batch value of the records 
processed for said entity includes the step of updating 
the last batch value associated with said destination site 
to a highest batch value of the records dequeued for 
said destination site. 

3. The method of claim 2 wherein: 

said deferred transaction queue comprises a table; and 
the step of dequeuing is performed by executing a query 
on said table. 

4. The method of claim 2 wherein: 

each record includes a prepared time; and 

the step of processing a batch of said record farther 

includes the step of sorting said records in said batch 

based on said prepared time. 

5. The method of claim 1 wherein the step of marking said 
set of records is perfonmed every time the step of processing 
a batch of said records is performed. 

6. The method of claim 1 wherein the step of processing 
a batch of said records is performed by a different process 
for each of said plurality of entities. 

7. The method of claim 1 further comprising the step of 
initializing said batch counter value to a value that is greater 
than said default batch value. 

8. A method for processing records that have been added 
to a set of records, where records are repeatedly being added 
to said set of records, where each record in said set of 
records has to be processed once for each of a plurality of 
entities, the method comprising the steps of: 

assigning each record that is added to said set of records 
to a temporary batch that is marked older than any prior 
batch; 

for each entity of said plurality of entities, processing said 
records by repeatedly performing the steps of: 
determining a last batch processed for said entity; and 
processing the records in all batches that are newer than 
said last batch processed for said entity; and 

between processing said records twice for an entity of said 
plurahty of entities, marking said temporary batch as 
newer than any prior batch. 

9. The method of claim 8 wherein: 

the step of assigning each record that is added to said set 
of records to a temporary batch that is marked older 
than any prior batch includes the step of adding a 
default batch number to each record; 

the step of determining a last batch processed for said 
entity includes the step of reading a last batch number 
for said entity; 
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the step of processing the records in all batches that are 
newer than said last batch processed for said entity 
includes the step of processing aU records that have 
been marked with batch numbers that are more recent 
than the last batch number for said entity. 5 

10. The method of claim 9 further comprising the steps of: 
maintaining a batch number counter; 

wherein the step of marking said temporary batch as 
newer than any prior batch includes the steps of: 
incrementing the batch number counter; and 
assigning the batch number counter to all records that 
currently contain the default batch number. 

11. The method of claim 8 wherein: 

the plurality of entities are destination sites to which 
changes must be replicated; and 

the step of processing said records for each of said entities 
includes, for each destination site, repeatedly perform- 
ing the steps of: 

determining a last batch processed for said destination 
site; and 

processing the records in all batches that are newer than 
said last batch processed for said destination site. 

12. The method of claim 11 wherein: 

the step of processing the records in all batches that are 25 
newer than said last batch processed for said destina- 
tion site includes placing in a data container entries for 
transactions associated with said records; and 

the method further comprises the step of propagating to 
the destination site changes made by the transactions 30 
represented by said entries in the data container. 

13. A computer-readable medium having stored thereon 
sequences of instructions for processing records that have 
been added to a set of records, where records are repeatedly 
being added to said set of records, where each record in said 35 
set of records has to be processed once for each of a plurality 

of entities, the sequences of instructions including sequences 
of instructions for performing the steps of: 

assigning each record that is added to said set of records 
to a temporary batch that is marked older than any prior 40 
batch; 

for each of said entities, processing said records by 

repeatedly performing the steps of: 

determining a last batch processed for said entity; and 

processing the records in all batches that are newer than 
said last batch processed for said entity; and 
between processing said records twice for an entity of said 

plurality of entities, marking said temporary batch as 

newer than any prior batch. 

14. The computer-readable medium of claim 13 wherein: 
the step of assigning each record that is added to said set 

of records to a temporary batch that is marked older 
than any prior batch includes the step of adding a 
default batch number to each record; 

the step of determining a last batch processed for said 
entity includes the step of reading a last batch number 
for said entity; 

the step of processing the records in all batches that are 
newer than said last batch processed for said entity go 
includes the step of processing all records that have 
been marked with batch numbers that are more recent 
than the last batch number for said entity. 

15. The computer-readable medium of claim 14 further 
comprising sequences of instructions for performing the ^5 
steps of: 

maintaining a batch number counter; 
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wherein the step of marking said temporary batch as 
newer than any prior batch includes the steps of: 
incrementing the batch number counter; and 
assigning the batch number counter to all records that 
currently contain the default batch number. 

16. The computer-readable medium of claim 13 wherein: 
the plurality of entities are destination sites to which 

changes must be replicated; and 
the step of processing said records for each of said entities 
includes, for each destination site, repeatedly perform- 
ing the steps of: 

determining a last batch processed for said destination 
site; and 

processing the records in all batches that are newer than 
said last batch processed for said destination site. 

17. The computer-readable medium of claim 16 wherein: 
the step of processing the records in all batches that are 

newer than said last batch processed for said destina- 
tion site includes placing in a data structure entries for 
transactions associated with said records; and 
the computer-readable medium further comprises 
sequences of instructions for performing the step of 
propagating to the destination site changes made by the 
transactions represented by said entries in the data 
structure. 

18. A method for processing records that have been added 
to a set of records, where records are repeatedly being added 
to said set of records, where each record in said set of 
records has to be processed once for each of a plurality of 
entities, the method comprising the steps of: 

assigning each record that is added to said set of records 
to a temporary batch that is marked older than any prior 
batch; 

processing for an entity of said plurality of entities the 
records in all batches that are newer than a last batch 
processed for said entity; and 

before processing said records again for said entity, mark- 
ing said temporary batch as newer than any prior batch. 

19. The method of claim 18 wherein: 

the plurality of entities are destination sites to which 
changes must be replicated; and 

the step of processing for said entity the records in all 
batches that are newer than a last batch processed for 
said entity includes repeatedly performing the steps of: 
determining a last batch processed for a destination 
site; and 

processing the records in all batches that are newer than 
said last batch processed for said destination site. 

20. A computer- readable medium having stored thereon 
sequences of instructions for processing records that belong 
to a set of records, where records are repeatedly being added 
to said set of records, where each record in said set of 
records has to be processed once for each of a plurality of 
entities, the sequences of instructions including sequences of 
instructions for performing the steps of: 

marking each record that is added to said set of records 

with a default batch value; 
for each entity of said plurahty of entities, processing a 
batch of said records by performing the steps of: 
reading a last batch value associated with said entity; 
processing the records in said set of records that are 

marked with batch values that are more recent than 

said last batch value associated with said entity; and 
updating the last batch value associated with said entity 

to a most recent batch value of the records processed 

for said entity; 
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between processing consecutive batches for an entity of 
said plurality of entities, marking said set of records by 
performing the steps of: 

updating a batch counter value to reflect a more recent 
batch niunber; and 5 

marking all records in said set of records that have said 
default batch value with said batch counter value. 

21. The computer-readable medium of claim 20 wherein: 
the plurality of entities is a plurality of destination sites; 

the set of records is a deferred transaction queue where 
each record corresponds to a set of changes to be 
propagated to one or more of said plurality of 
destination sites; 

the step of marking each record that is added to said set 
of records with a default batch value includes mark- ^5 
ing each record that is added to said deferred trans- 
action queue with a default batch value; 

the step of reading a last batch value associated with 
said entity includes the step of reading a last batch 
value associated with a destination site of said plu- 
rality of destination sites; 

the step of processing the records in said set that are 
marked with batch values that are more recent than 
said last batch value associated with said entity 
includes the step of dequeuing for said destination 
site the records in said set that are marked with batch 
values that are more recent than said last batch value 
associated with said destination site; and 

the step of updating the last batch value associated with 
said entity to a most recent batch value of the records 
processed for said entity includes the step of updat- 
ing the last batch value associated with said desti- 
nation site to a highest batch value of the records 
dequeued for said destination site. 

22. The computer-readable medium of claim 21 wherein: ^5 
said deferred transaction queue comprises a table; and 

the step of dequeuing is performed by executing a query 
on said table. 

23. The computer-readable medium of claim 21 wherein: 
each record includes a prepared time; and 

the step of processing a batch of said record further 
includes the step of sorting said records in said batch 
based on said prepared time. 
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24. The computer-readable medium of claim 20 wherein 
the step of marking said set of records is performed every 
lime the step of processing a batch of said records is 
performed. 

25. The computer-readable medium of claim 20 wherein 
the step of processing a batch of said records is performed 
by a different process for each of said plurality of entities. 

26. The computer-readable medium of claim 20 further 
comprising instructions for performing the step of initializ- 
ing said batch counter value to a value that is greater than 
said default batch value. 

27. A system for processing records, the system compris- 
ing: 

a plurality of entities; 

a scheduler heap containing a set of records, wherein 
records are repeatedly being added to said set of 
records; 

a propagation mechanism, wherein the propagation 
mechanism is configured to process each record in said 
set of records, wherein each record in said set of 
records is processed once for each of the plurality of 
entities by performing the steps of: 
marking each record that is added to said set of records 

with a default batch value; 
for each entity of said plurality of entities, processing 
a batch of said records by performing the steps of: 
reading a last batch value associated with said entity; 
processing the records in said set of records that are 
marked with batch values that are more recent 
than said last batch value associated with said 
entity; and 

updating the last batch value associated with said 
entity to a most recent batch value of the records 
processed for said entity; 
between processing consecutive batches for an entity of 
said plurality of entities, marking said set of records 
by performing the steps of: 
updating a batch counter value to reflect a more recent 

batch number; and 
marking all records in said set of records that have said 
default batch value with said batch counter value. 
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